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Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is 
computationally infeasible. The recently proposed stochastic gradient Langevin dynamics 
(SGLD) method circumvents this problem in three ways: it generates proposed moves using 
only a subset of the data, it skips the Metropolis-Hastings accept-reject step, and it uses 
sequences of decreasing step sizes. In |Teh et ah, 2014| , we provided the mathematical foun¬ 
dations for the decreasing step size SGLD, including consistency and a central limit theorem. 
However, in practice the SGLD is run for a relatively small number of iterations, and its step 
size is not decreased to zero. The present article investigates the behaviour of the SGLD 
with fixed step size. In particular we characterise the asymptotic bias explicitly, along with 
its dependence on the step size and the variance of the stochastic gradient. On that basis a 
modified SGLD which removes the asymptotic bias due to the variance of the stochastic gra¬ 
dients up to first order in the step size is derived. Moreover, we are able to obtain bounds on 
the finite-time bias, variance and mean squared error (MSE). The theory is illustrated with 
a Gaussian toy model for which the bias and the MSE for the estimation of moments can 
be obtained explicitly. For this toy model we study the gain of the SGLD over the standard 
Euler method in the limit of large data sets. 
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1 Introduction 


A standard approach to estimating expectations under a given target density tt ( 9 ) is to 
construct and simulate from Markov chains whose equilibrium distributions are designed 
to be 7r Brooks et al., 2011b . A well-studied approach, for example in molecular dynam¬ 


ics [Leirn kuhler and Matthews, 2013;, Bou-Rabee and Vanden-Eijnden, 20l0| and throughout 
Bayesian statistics Milstein and Tretyakov, 2007, Ne al, 20TT| , is to use Markov chains con¬ 
structed as numerical schemes which approximate the time dynamics of stochastic differential 
equations (SDEs). In this paper we will focus on the case of first order Langevin dynamics, 


* vollmer@stats. ox. ac.uk 
' k.zygalakis@soton.ac.uk 
U-w.teh@stats.ox.ac.uk 


1 

















which has the form 


( 1 ) 


dO(t) = -Vlog7r (9(t))dt + dW t , 

where t E K+, 9 E JR 6 * and Wt is a d-dimensional standard Brownian motion. Under appro¬ 
priate assumptions on it(9), it is possible to show that the dynamics generated by Equation 
§ are ergodic with respect to tt(0). 

The simplest possible numerical scheme for approximating Equation ([Tj) is the Euler- 
Maruyama method. Let h > 0 be a step size. Abusing notation, the diffusion 9{k ■ h ) at time 
k ■ h is approximated by Ok, which is obtained using the following recursion equation 


9k +1 = Ok + log 7r(0fc) + Vh£k, 


( 2 ) 


where is a standard Gaussian random variable on M rf . One can use the numerical tra¬ 
jectories generated by this scheme for the construction of an empirical measure iTh{0) either 
by averaging over one single long trajectory or by averaging over many realisations in order 
to obtain a finite ensemble average (see for example |Milstein and Tretyakov, 2007 ). How¬ 
ever, as discussed in Robert s and Twe edie, 1996j, one needs to be careful when doing this 
as it could be the case that the discrete Markov chain generated by Equation ([2]) is not 
ergodic. But even if the resulting Markov Chain is ergodic, vr^(0) will not be equal to 7 r(0) 
Mattingl y et al., 2010} |Abdulle et al., 2014| which thus implies that the resulting sample av¬ 
erage is biased. An alternative strategy that avoids this discretization bias and the ergodicity 
of the numerical procedure, is to use Equation ([2]) as a proposal for a Metropolis-Hastings 
(MH) MCMC algorithm Brooks et al., 2011a], with an additional accept-reject step which 
corrects the discretization error. 

In this paper we are interested in situations where it arises as the posterior in a Bayesian 
inference problem with prior density 7To(0) and a large number N 3> 1 of i.i.d. observations 
Xj with likelihoods ir{Xi\9). In this case, we can write 


N 


ir(9) oc 7 t 0 (6>) Jj7rpG|0), 


(3) 


2=1 


and we have the following gradient, 

Vlogvr(0) = V log 7To(9) + £ Vlog7r(Xi|0). 


N 


(4) 


2=1 


In these situations each update ([2]) has an unpractically high computational cost of 0(N ) 
since it involves computations on all N items in the dataset. Likewise, each MH accept-reject 
step is impractically expensive. 

In contrast, the recently proposed stochastic gradient Langevin dynamics (SGLD) [Wi lling and Teh, 2011] 
circumvents this problem by generating proposals which are only based on a subset of the 
data, by skipping the accept-reject step and by using a decreasing step-size sequence ( h k )k>o■ 

In particular one has 


9k +1 = 0 k + ^Vlogvr^fc) + 'Jhkik, 

___ /y n 

V log ir{9k) = Vlog7r 0 (6> fc ) + — ^ V logvrpG-J 0 k ) 


(5) 

( 6 ) 


2=1 
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where are independent standard Gaussian random variables on M rf , and 77 . = (r k 1 , • ■ ■ , r kn ) 
is a random subset of [N] := {1, • • • , N} of size n, generated, for example, by sampling with 
or without replacement from [N]. The idea behind this algorithm is that, since the stochastic 
gradient appearing in Equation ([ 5 ]) is an unbiased estimator of the true gradient Vlog7r(0), 
the additional perturbation due to the gradient stochasticity is of order h, smaller than the 
Vh order of the injected noise, and so the limiting dynamics (k —»• 00) of Equation ([ 5 ]) should 
behave similarly to the case n = N. In |Teh et ah, 20141 it was shown that in this case that 
the hi-step size weighted sample average is consistent and satisfies a CLT with rate depending 
on the decay of h k . The optimal rate is limited to K~ 3 and achieved by an asymptotic step 
size decay of X K~ 3 . 

The problem with decaying step sizes is that the efficiency of the algorithm slows the longer 
it is run for. A common practice for the SGLD and its extensions, the Stochastic Gradient 


Hamiltonian Monte Carlo Chen et ah, 20141 and the Stochastic Gradient Thermostat Monte 


Carlo algorithm Ding_et ah, 2 014| , is to use step sizes that are only decreasing up to a point. 
The primary aim of this paper is to analyse the behaviour of SGLD with fixed step sizes of 
hk = h. We provide two complementary analyses in this setting, one asymptotic in nature 
and one finite time. Let : l 6 * -> R be a test function whose expectation we are interested 
in estimating. Using simulations of the dynamics governed by Equation ([ 5 ]), we can estimate 
the expectation using 


E, 


w)]«l £>(<y 


(7) 


k=1 


for some large number of steps K. Our analyses shed light on the behaviour of this estimator. 

In the first analysis, we are interested in the asymptotic bias of the estimator ([t]) 

K — > 00, 

K 


as 


lim '<K4)-^[<£(0)]. 


K—too K 


( 8 ) 


k =1 


Assuming for the moment that the dynamics governed by Equation ([ 5 ]) is ergodic, with invari¬ 
ant measure 7 Xh{9\n), the above asymptotic bias simply becomes E T/i (..„) [<t>(9)\ — E n [(j)(9)\. In 
the case of Euler-Maruyama, where n = N and the gradient is computed exactly, the asymp¬ 
totic behaviour of the dynamics is well understood, in particular its asymptotic bias is O(h) 
Matti ngly et ah, 2010| . When n < N, using the recent generalisations Abdulle et al., 201 4[ 
|Sato and Nakagawa, 2014] of the approach by |Talay and Tubaro, 199 0j reviewed in Section 
[3j we are able to derive an expansion of the asymptotic bias in powers of the step size h. This 
allows us to explicitly identify the effect, on the leading order term in the asymptotic bias, 
of replacing the true gradient Q with the unbiased estimator ([6]). In particular, we show in 
Section [4] that, relative to Euler-Maruyama, the leading term contains an additional factor 
related to the covariance of the subsampled gradient estimators ([6]). 

Based on this result, in Section 4.2 we propose a modification of the SGLD (referred to 
simply as mSGLD) which has the same asymptotic bias as the Euler-Maruyama method up 
to first order in h. The mSGLD is given by 


9 k+ i =9 k + ^Vlog7r(6» fe ) + Vh (i - ^Cov Vlog7r(0A : ) 


0 


(9) 


where Gov 


Vlog7r(0fc) is the covariance of the gradient estimator. When the covariance is 
unknown, it can in turn be estimated by subsampling as well. This modification is different 
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from the Stochastic Gradient Fisher Scoring |S. Ahn and Welling, 2012 , a modification of 
the injected noise in order to better match the Bernstein von Mises posterior. In contrast, 
the mSGLD is a local modification based on the estimated variance of the stochastic gradient. 

The second contribution provides a complementary finite time analysis. In the finite time 
case both the bias and the variance of the estimator are non-negligible, and our analysis 
accounts for both by focussing on bounding the mean squared error (MSE) of the estimator 
Q. Our results, presented in Section [5j show that, 


E 


K -1 


K 


Y 4>(8k) - 7rO) 


k=0 


< C(n) ( h 2 + — 


( 10 ) 


where the RHS only depends on n through the constant C(n), the h 2 term is a contribution 
of the (square of the) bias while the 1 /Kh term is a contribution of the variance. We see 
that there is a bias-variance trade-off, with bias increasing and variance decreasing monoton- 
ically with h. Intuitively, with larger h the Markov chain can converge faster (lower variance) 
with the same number of steps, but this incurs higher discretization error. Our result is 
achieved by extending the work of Mattingly et ah, 2010 from to M rf . The main diffi¬ 
culty in achieving this relates to combining the results of |Pardoux and Veretennikov, 2001 
and |Teh et al., 2014], in order to establish the existence of nice, well controlled solutions 
to the corresponding Poisson equation Mattingly et ah, 20101. We can minimise Equation 

(10) over h, finding that the minimizing h is on the order of K~ 3, and yields an MSE of 
- - 2 . ... . _ 1 

order K 3. This agrees, surprisingly, with the scaling of K 3 for the central limit the¬ 
orem established for the case of decreasing step sizes, for the Euler-Maruyama scheme in 
|Lamberton and Pages, 2002 and for SGLD in |Teh et ah, 2014| . This unexpected result, 
that the decreasing step size and fixed step size discretisations have, up to a constant, the 
same efficiency seems to be missing from the literature. 

Our theoretical findings are confirmed by numerical simulations. More precisely, we start 
by studying a one dimensional Gaussian toy model both in terms of the asymptotic bias and 
the MSE of time averages in Section [2] The simplicity of this model allows us to obtain 
explicit expressions for these quantities and thus illustrate in a clear way the connection with 
the theory. More precisely, we confirm that the scaling of the step size and the number of steps 


for a prescribed MSE obtained from the upper bound in Equation (10) matches the scaling 


derived from the analytic expression for the MSE for this toy model. More importantly, this 
simplicity allows us to make significant analytic progress in the study of the asymptotic bias 
and MSE of time averages in the limit of large data sets N —> oo. In particular, we are able 
to show that the SGLD reduces the computational complexity by one order of magnitude 
in N. for the estimation of the second moment in comparison with the Euler method if the 
error is quantified through the MSE. 

In summary, this paper is organised as follows. We present our first explorations of the 
SGLD applied to a one-dimensional Gaussian toy model in Section [2] For this model we 
obtain an analytic characterisation of its bias and variance. This serves as intuition and 
benchmark for the two analyses developed in Sections i to [5j In Section [3] we review some 
known results about the effect of the numerical discretisation of Equation (jTj) in terms of 
the finite time weak error as well as in terms of the invariant measure approximation. In 
Section [4] we apply these results to analyse the finite and long time properties of the SGLD, 
as well as to construct the modified SGLD algorithm which, asymptotically in h, behaves 
exactly as the Euler-Maruyama method (n = N ) while still sub-sampling the data set at 
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each step. Furthermore, in Section [5] we discuss the properties of the finite time sample 
averages, include its MSE. In Section [6] we revisit the Gaussian toy model to obtain a more 
precise understanding of the behaviour of SGLD in a large data and high accuracy regime. 
This is achieved using analytic expressions of the expectations of the sample average which are 
obtained using the supplemented Mathematica® notebook described in detail in Appendix 


logistic regression model which matches the theory, while, we conclude this paper in Section 
[8] with a discussion on some possible extensions of this work. 


10 Finally, in Sections] we demonstrate the observed performance of SGLD for a Bayesian 


2 Exploring a one-dimensional Gaussian Toy Model 


In this section, we develop results for a simple toy model, which will serve as a benchmark 
for the theory developed in Sections [3] to [5} In particular, we obtain analytic expressions for 
the bias and the variance of the sample average of the SGLD, allowing us to characterise its 
performance in detail. 

We consider a one-dimensional linear Gaussian model, 


9 ~ AA(O,cr0), 

Xi I e im ~- Af(9, a*) for i = 1,..., N. 


The posterior is given by 


S£i * (L,* 

+ N\°e a x 


vr = a r(n P ,cTp) = a r 




-i 


For this choice of 7r, the Langevin diffusion 0 becomes, 


dm = - 2 


1 ( 6{t) - Up 


at 


dt + dWt, 


and its numerical discretisation by the SGLD with step size h reads as follows, 


( 11 ) 


( 12 ) 


(13) 


9k+1 — (1 — Ah)e k + B k h + 


(14) 


where A/"(0,1) and 


1 ( — 

2 W + a x) 

N E"=i Xth 

k n 2cr x 


(15) 


where r k = (r k i, • • • , r kn ) denote a random subset of [N] = { 1 , • • • , N} generated, for exam¬ 
ple, by sampling with or without replacement from [N], independently for each k. We note 
that the updates (14) will be stable only if 0 < 1 — Ah < 1, that is, 0 < h < 1 /k0 In the 


following we will also consider parameterising the step size as h = r/A where 0 < r < 1. 


1 Note that the posterior variance is x 1/A, so that steps of size x 1/A are 1 /\J~A. smaller than the width 
of the posterior. However the injected noise has variance x 1/A which matches the posterior variance. 
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We denote B = (B k ) k > 0 . At the risk of obfuscating the notation, we will denote by 
Var(L>) the common variance of B k for all k. For sampling with replacement, we have 


Var (B) 


1 N 
icrff n 



1 

N 



2 


1 N(N - 1) 
4a£ n 


Var(A), 


where Var(X) is the typical unbiased empirical estimate of the variance of {X \,..., A/v}. 
For sampling without replacement we have, 


Var(£>) 


1 N ( N _ n) "( 

Aatn{N-l) 2L, { 1 


1 

N 



2 


1 N(N — 
4 aff. n 


—Var(A). 


(16) 


2.1 Analysis of the Asymptotic Bias 


We start by inspecting the estimate of the posterior mean. In particular, using Equation (14) 
and taking expectations with respect to we have 


E(O k+1 \B) = (1 - Ah)E(O k \B) + B k h 


(17) 


which can be solved in order to obtain 

M—l 

E(d M \B) = (1 - Ah) M E(6 0 ) + Ml - Ah) k B M - k - 1- 

fc =0 


If we now take the expectation with respect to the random subsets B k , using the fact that 
E(B k ) = E(B) and take the limit of M —> oo, we have 


E(0oo) = ^(1 - Ah) k hE{B) = h/{ 1 - (1 - Ah))E(B) 

k =0 


E(B) 

A 


Ef=i Xi 




+ N 


We thus see that the SGLD is capturing the correct limiting mean of the posterior indepen¬ 
dently of the choice of the step size h. In other words, for the test function (f)(6) = 6, the 
asymptotic bias is nil. 

We now investigate the behaviour of the limiting variance under the SGLD. Starting with 
the law of total variance, 


Var[4+i] = E(Var[0 fc+1 | B}) + Var(E [6 k+1 \ B}), 
a simple calculation now shows that 

Var[0 fc+ i | B] = (1 - Ah) 2 Var [9 k \ B\ + h 

and 

Var(E[0 fc+1 | B}) = (1 - Ah) 2 Var(E[9 k \ B}) + h 2 Var(B k ). 
Combining these two results, we see that 

Var(0fc + i) = (1 - Ah)' 2 Var (9 k ) + h + h 2 Vai(B k ). 
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If we now take the limit of k —> oo, we have that 


Var(6 , 00 ) 


1 hVar(B) 

2 A - A 2 h + 2A-A 2 h' 


(18) 


where Var(L>) is the common value of Var (B k ) for all k > 0. We note here that in the case of 
the Euler-Maruyama method (from here on we will simply refer to this as the Euler method) 
where n = N and Var(H) = 0, only the first term remains. In other words, the first term 
is an (over-)estimate of the posterior variance o 2 = 1/2 A obtained by the Euler-Maruyama 
discretisation at step size h. Our result here coincides with [Zygalakis, 2011| . On the other 
hand, the second term is an additional bias term due to the variability of the stochastic 
gradients. Further, using a Taylor expansion in h of the second summand, we see that the 
SGLD has an excess bias, relative to the Euler method, with first order term equal to 


Var (B) 
1 2 A 


(19) 


Using the fact that Var(0 oo ) = E[6^] — E)^] 2 , and that the asymptotic bias of estimating 
E[0] is nil in this simple model, we see that the above gives the asymptotic biases of the Euler 
method and SGLD in the case of the test function </>(#) = 6 2 . 

We now consider the modified SGLD given in Equation (|9|) and to be discussed in Section 


4.2 In this case the numerical discretisation of Equation (13) becomes 


0k +1 = 0k~ Ah9 k + B k h + Vh^l- ^ Var(H)^ £ k 


( 20 ) 


A similar calculation as for the SGLD shows that 


E(0oo) 


Var((9oo) 



1 h 2 Var 2 (L>) 

2 A - A 2 h + 4(2 A - A 2 h )' 


( 21 ) 


with the last term being the excess asymptotic bias. A Taylor expansion of the excess bias 
term shows that the term of order h vanishes and the leading term has order h 2 . Hence, for 
small h , the excess bias is negligible compared to the asymptotic bias of the Euler method, 
and we can say that, up to first order and in this simple example, the mSGLD has the same 
asymptotic bias as for the Euler method. In Section [4.2| we will show that these results hold 
more generally. 

It is useful to visualise the above analytic results for the asymptotic biases of the Euler 
method, SGLD and mSGLD. In Figure [l] we show this for a dataset of 1000 points drawn 
according to the model. The first observation is that the Euler method has lowest asymp¬ 
totic bias among all three methods (although of course it is also the most computationally 
expensive; see Section [6]). We observe that if we choose n = 10 points for each gradient 
evaluation, for large values of the step size h, the SGLD is superior to the mSGLD. However, 
as h is reduced, this is no longer the case. Furthermore, if we use a more accurate gradient 
estimation with n = 200 data points, we see that mSGLD outperforms SGLD for all the step 
sizes used, but more importantly its asymptotic bias is now directly comparable with the 
Euler method where all the data points are used for evaluating the gradient. 
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n=10 


Asymptotic bias 


n=200 


Asymptotic bias 



Figure 1: Comparison of the asymptotic biases for the SGLD, the mSGLD and the Euler 
method for the test function 4>(9) = 9 2 . For all simulations, we have used N = 10 3 . We used 
n = 10 on the LHS and n = 200 on the RHS. 


2.2 Finite Time Analysis 

In the previous subsection we analysed the behaviours of the three algorithms in terms of 
their biases in the asymptotic regime. In practice, we can only run our algorithms for a finite 
number of steps, say K. and it would be interesting to understand the behaviours of the 
algorithms in this scenario. With a finite number of samples, in addition to bias we also have 
to account for variance due to the Monte Carlo estimation. 

A sensible analysis accounting for both bias and variance is to study the behaviour of the 
mean squared error (MSE), say in the Monte Carlo estimation for the second moment, 


K -1 


MSE 2 := E 




( 22 ) 


fc =0 


We can expand the quadratic, and express MSE 2 as a linear combination of terms of the 
form E [9j\ for p = 1,2, 3,4. Each of terms can be calculated analytically, depending on the 
data set X, the total number of steps K, the subset size n, as well as the scaled step size 
parameter r = hA. We provide these calculations in Appendix |10.1| and a Mathematica® file 
in the supplementary materials. 

In Figure [2] we visualise the behaviour of the resulting MSE 2 for a fixed dataset with 
N = 1000 items, and with scaled step size r = 1/20. For the same number of steps M, the left 
figure shows that SGLD and mSGLD behaves similarly, decreasing initially then asymptoting 
at their asymptotic biases studied in the previous subsection. At r = 1/20 mSGLD has lower 
asymptotic biases than SGLD. Further, both MSE 2 ’s decrease with increasing subset size 
n, and are higher than that for the Euler method at n = 1000. Since SGLD and mSGLD 
computational costs per step are linear in n, the right figure instead plots the same MSE 2 ’s 
against the (effective) number of passes through the dataset, that is, number of steps times 
n/N. This quantity is now proportional to the computational budget. Now we see that 
smaller subset sizes produce initial gains, but asymptote at higher biases. 

These analytical results for a simple Gaussian model demonstrate the more general theory 
which forms the core contributions of this paper. Sections [3] and [4] develop a method to 
study the asymptotic bias as a Taylor expansion in h, while Section [5] provides a finite time 
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— SGLD — mSGLD 


— SGLD — mSGLD 




Figure 2: MSE 2 of the sample average for the SGLD and the mSGLD for the second moment 
of the posterior. 


analysis in terms of the mean squared error. Both analyses are based on the behaviour 
of the algorithms for small step sizes, and in this regime we see that mSGLD has better 
performance than SGLD. In Section [6] we will return to the simple Gaussian model to study 
the behaviour of the algorithms using different measures of performance and in different 
regimes. In particular, we will see that for larger step sizes SGLD has better performance 
than mSGLD. 


3 Review of Weak Order Results 


In this section we review some existing results regarding the ergodicity and accuracy of nu¬ 


merical approximations of SDEs. We start in Section 3.1 by introducing the framework and 
notation, the Fokker-Planck and backward Kolmogorov equations, and with some prelim¬ 
inary results on local weak errors of numerical one-step integrators. Section |3.2| presents 
assumptions necessary for ergodicity, and extends the results to a global error expansion of 
the weak error as well as the error in the approximation of the invariant measure. Finally, 


in Section T3 we apply our results to explicitly calculate the leading order error term of the 
numerical approximation of an Ornstein-Uhlenbeck solved by the Euler method. 


3.1 One-step Numerical Approximations of Langevin Diffusions 

Let us denote by p(y, t ) the probability density of 0(t) defined by the Langevin diffusion (JTj) 
with initial condition 0(0) = 0 and target density n (y). Then p(y,t ) is the solution of the 
Fokker-Planck equation, 


dp 

~di 


= £*p, 


(23) 


with initial condition p(y, 0) = 5(y — 6), a Dirac mass for the deterministic initial condition, 
and the operator C* given by 


Cp = -\vo- (VlogT t{6)p) + -Ve-p. 


(24) 
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This operator is the L 2 -adjoint of the generator of the Markov process (9(t))t> o given by 0, 

C = ^ Vo log 7 t(0) ■ Vg + ^ Ag, (25) 

Given a test function f>, define u(9, t ) to be the expectation, 

u(9,t)=E((f>(9(t))\9(0) = 9), (26) 

with respect to the diffusion at time t when started with initial condition 9(0 ) = 9. We note 
that u(9,t) is the solution of the backward Kolmogorov equation 

%=Cu, (27) 

u(9,0 ) = 4>(9). 


A formal Taylor series expansion for u in terms of the generator C was derived in 
Zygala kis, 201 1| and made rigorous by |Debussche and Faou, 2 012 
state space is 9 e T d . The Taylor series is of the following form, 


for the case where the 


u(9, h) = <t>(9) + J2 + h l+1 n(9 ), 

3 =1 J ' 


(28) 


for all positive integers l, with the remainder satisfying a bound of the form \ri(9)\ < q( 1 + 
\9\ Kl ) for some constants ci,m depending on 7r and cf>. 


Remark 3.1. Another way to turn u(9,h ) = (f)(9) + hC(f) + + ■■■ into a rigorous 

expansion, see Equation (28), is to follow the approach in JTalay and Tubaro, 199(\ Lemma 
2] and to assume that log7r is C°° with bounded derivatives of any order (and this is the 
approach we follow here). This fact, together with the assumption that 


\m\<c(i + \9n 


(29) 


for some positive integer s is enough to prove that the solution u of Equation (27) has deriva 


tives of any order that have a polynomial growth of the form of Equation (29), with other 
constants C,s that are independent of t E [0, T]. In turn, these regularity bounds establish 
that Equation (28) holds. We mention here that the regularity conditions where relaxed in 


recent work in \Kopec, 201 Iff for the elliptic case and in \Kopec, 2015 1/ for the hypoelliptic 


case. 


Now assume that one solves Equation 0 numerically with a one step integrator, which 
we shall denote by, 

9 n + 1 = l k( 9 n ,h,f, n ), (30) 

where 9q = 9(0), h denotes the step size, are iid M(0, 1), and 9 n denotes the numerical 
approximation of 9(nh) for each n £ N. For example in the case of the Euler method for 
equation (jTj) one has 

T(<9, h,£) = 9 + ^Vlog7r (9) + Vhf 
Now, using this formulation we can define 


U(9,h)=E(cj)(9 1 )\9o = 9), 


(31) 
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for the expectation of the test function after one step of the numerical integrator starting 
with the initial condition 6$ = 0. We will make the following (easily satisfied) regularity and 
consistency assumptions about the integrator: 

Assumption 3.1. We assume that the following hold: 


• V log 7r is C°° with bounded derivatives of all orders. 

• For all deterministic initial conditions 8 q, we have 

|E (01 -0 O )| < C(l + \6 0 \)h, and \0i - 9 0 \ < M(1+ \9 0 \)Vh, (32) 


where C is a constant independent of h, for h small enough and M is a random variable 
that has bounded moments of all orders independent of h and 8 q. 


Equation (31) has a weak Taylor series expansion of the form 


U(e, h) = (f>{9) + hA 0 {n)f(e) + h 2 A\{Tt)(j)(6) + • • • , 


(33) 


where Ajfn), z = 0,1,2 ,... are linear differential operators with coefficients depending 
smoothly on the drift function Vlog7r(0) and its derivatives (depending on the choice 
of the integrator). 

• Aq(k) coincides with the generator C, in other words, the numerical method has weak 
order at least one. 


Assumptions |3.1| immediately imply the existence of a rigorous expansion 


U(9 , h) = </>{9) + Y, h l+1 Mir)ct)(d) + h^R^O) 


(34) 


»=o 


for all positive integers l, with a remainder satisfying \Ri(9)\ < C/(l + |0| Ai ) for some constants 
Ci,K[. We say that the numerical solution has local weak order p if the first p terms in the 


expansion (33) of the numerical approximation agrees with that (28) for the exact diffusion. 


In this case, it is easy to see that the following local error formula holds, 

E(mh))\m = o)~ nmm = o) = h? +i - a^ m + o(^+ 2 ). (35) 


3.2 Global Weak Error Expansion 

In this subsection, we will extend the local weak error expansion to a global one. Specifically, 
after M steps of the numerical integrator with step size h, we are interested in the difference 
between 9m and the exact diffusion 9(T) where T = Mh , as evaluated by the difference 
between the corresponding expectations of (j>, 


E{f, h, T) = E(0(0(T))|0(O) = 9) - E(0(0 m )|0o = 0), (36) 

In order for this study to make sense (when considering the limit T —> oo), we will 
require that the SDE and its numerical approximation are both ergodic. We make standard 
assumptions in order for the Langevin diffusion (9(t))t> o as given by 0 to be ergodic (see 
Hasminskii, _19801): 
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Assumption 3.2. We assume that the following hold for the Langevin diffusion (6{t))t> o' 

• V log 7 r is C°° with bounded derivatives of all orders. 

• there exists (3 > 0 and a compact set K C such that V 6 e M d \A, 

< 0 ,Vlog 7 r( 0 )> <-/3||0|||. 


The question of the ergodicity of the numerical approximation (9 n ) is considerably more 
intricate in general. There exist cases where the underlying Langevin diffusion is ergodic, 
but its numerical approximation is not ergodic, or does not converge exponentially fast 
Roberts and Tweedie, 19961. This relates mainly to the properties of the drift coefficient 
and its behaviour at infinity. For the Euler-Maruyama and the Milstein scheme this has been 
investigated in |Talay and Tubaro, 19901. In what follows we will simply assume that the 
Markov chain ( 6 n ) defined by the numerical approximation is indeed ergodic. Under this as¬ 
sumption the following theorem, which combines results derived by (Talay and Tubaro, 1990 
and (Milstein, 1986), can be shown (see [Abdulle et al., 20l4l | for a proof): 


Theorem 3.2. Suppose that the state space is that Assumptions 3.1 and 3.2 hold, and 
that the Markov chain (0 n )n> o defined by the one step integrator (30) is ergodic. If the 
numerical approximation has local weak order p, that is, Equation (35) holds, then we have 
the following expansion of the global error (36), for all 4> £ Cp‘~ (MUM), 


h, T) 


where e{0, t ) is given by 


r T 

hP / E (e(0{s),s))ds + O(h p+1 ), 

Jo 


e(6,t) 


1 rp+ l _ a 

(p + i)r Ap 


v(0,t), 


(37) 


(38) 


with v(0,t) = K(cj)(0(T))\O(t) = 0) satisfying 


dv 

dt 

v(9,T) 


—Cv, 

m- 


(39) 


The expression (37) was proved by |Talay and Tub aro, 1990] for specific methods (e.g. the 
Euler-Maruyama or the Milstein methods), while the general procedure to infer the global 
weak order from the local weak order is due to | Milstein , 1986| (see also |Milstein and Tretyakov, 2004 
Chapter 2.2]). However, the formulation of the error function (38) here is in terms of the gen¬ 


erator C and the operators A* in Assumption 3.1 and does not contain any time derivatives 


as in (Talay and Tubaro, 1990, Milstein, 1986| . This formulation will be particularly useful 
for obtaining our main results. 


Using Theorem 3.2, one can obtain a similar expansion to that in Equation (37) for the 


difference between the true and the numerical ergodic averages: 


Theorem 3.3. Suppose that Assumption 3.2 holds, that our numerical method with deter¬ 
ministic initial condition is ergodic and of weak order p, and that </> : —>• M is a smooth 


function satisfying Equation (29). Then, 


1 K ~ 1 

l im -77 ^ 

K—t oo K L ' 

71—0 


<,Ky) 7r (y)dy = -\h v + o(h p+1 ) 


(40) 
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where X p is defined as 


Xp — 


0 + 1 )! 


C p+l - A p ) u(y, t)n(y)dt dy , 


(41) 


and u(y , t ) satisfies Equation 


The proof is given in |Abdulle et al., 2014] , and is similar to that in |Talay and Tubaro, 1990 


Theorem 4] with the main difference being that Equation (37) is used as the starting point, in¬ 
stead of the specific formula for the Euler-Maruyama method used in |Talay and Tubaro, 1990 


Theorem 3.3 provides an explicit expression for the leading order term of the asymptotic 
bias of the numerical method. It will thus be the key result in our analysis of the asymptotic 
behaviour of SGLD later. Intuitively Equation (41) says that if we want to calculate the 
error between the numerical and the true ergodic averages, we need to take into account the 
long time (t —> oo) discrepancy between the true and the numerical solution given by 


1 


-C p+l - A^j u(y,t), 


,(p + 1) ! 

and then average over all possible initial conditions y with respect to invariant measure n (y). 


3.3 An Illustrative Example 

We illustrate the weak order results above in the case of the Euler-Maruyama scheme applied 
to the Ornstein-Uhlenbeck process. For ease of notation, let f(x ) = 5Vlog7r(0). The Euler- 
Maruyama update steps are, 


9 n +l — On + hf(O n ) + y/h^ri 

A straightforward calculation |Zygalakis, 201 1| yields that the differential operator A± in 
is given by 

Aiif = ^/ T V 2 V’/ + ^ ^2 ^ 4 Hei,ei,ej,ej) 


(42) 


(43) 


i=l 


i,j =1 


where ei,...,ed denotes the canonical basis of and if'" (■,-,-) and are the 

derivatives of if, which are trilinear and quadrilinear forms, respectively. In dimension d = 1, 
it reduces to 

A\if = -fif" + V" + V 4) 


2 

/r, 2 


In the case where 0 £ M and tt( 0) = e 1,0 ^ , the Langevin diffusion 0 corresponds 

to the one dimensional Ornstein-Uhlenbeck process, 


d0(t) = -- 


1 f 6(f) — y 


(7 


dt + dW t 


(44) 


For the test function cf(9) = 0 2 , a simple calculation reveals that the solution of Equation 
(27} is 

u(0, t) = a 2 (l- e- t/ff2 ) + 0 2 e~ t/rT2 + ^(1 - e ~t/^ 2 ) e -t/^ 2 + £^(i - e ~ t/2a2 ) 2 . (45) 
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The Euler-Maruyama scheme has weak order p = 1, and (see jZygalakis, 2011 for more 
details), 

l-o . 1 d Id 2 

2 C ~ Al = ~ ^d6 ~ 4^d?0' 


Using this together with Equation (45) for u(9,t), we find 


1 \ p-t/v 2 e t / 2ff2 (p, — 6) ((1 — e */ 2<r2N )/r + e t / 2a2 o) 

2* - “<»■'> “ -2?-- ~ 4cr 4 - Z 


Formula (41) now gives 


roo r+oo f e -t/* 2 e- t / 2a \p-6)^(l-e- t / 2 ^ p + 

Ai = — / / | ———,,— —-- —- A - | - ,— - d6dt 


i o 


2d 2 


4 a 4 


a/Svt. 


TTtJ 


1 

4' 


(46) 


This is in agreement with known results on the stationary distribution of the Euler-Maruyama 
approximation to the Ornstein-Uhlenback process, see jZygalakis, 2011 : 


cr 2 2 1 


-K h ~ JV(/i, crft) where a 2 h = - - } -- = a 2 + -h + 0(h 2 ). 

1 - 1a~ 2 4 


4 Weak Convergence Analysis 


We study the weak convergence properties of the SGLD method in the light of Theorems 3.2 
and 3.3 The analysis in Section|4~l1 implies that at leading order there is a cost associated with 


not calculating the likelihood over all points. Thus, we introduce in Section 4.2 a modification 
of the original algorithm which has an error that is, asymptotically in h, identical to the error 
of the Euler method, when all data points are taken into account in the calculation of each 
likelihood gradient. 


4.1 Stochastic Gradient Langevin Dynamics 

Theorems |3 . 2| and |3 . 3| imply that in order to characterise the leading order error term both for 
the weak convergence and the invariant measure, we need to calculate the corresponding dif¬ 


ferential operators Aq, Ai, ■ ■ ■ in Equation (33). To simplify the presentation and to illustrate 


the main ideas, we present the calculations only in the case where 9(t) is one dimensional. 
We start our calculations by rewriting the SGLD method in the following form 


Oj+i — 9j + hfj(6j) + 


ji 


(47) 


where 

1 / N n \ 
m = - ( Vlog7ro(0) + -^Vlog7r(X Tj ,|0) 1 , 

tj is the subset (possibly with repetition) chosen at step j and, 

E T J j (0) = f(0)-.= ^Vlog7r(9), V n<N. 


(48) 
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Expanding 4>(0j+ 1 ) in powers of h and then taking expectations with respect to the injected 
random noise £j, 


E^(M-+i)l^) = Wi) + h ffiOfiofOfi + -<m-) 


+ t (fHWiOi)++ ^ ( 4 ) (^))+ o(h 3 ). 


If we now take expectations with respect to Tj, 

,2 


E(0(^ +1 )|^) = <K%)+hX</>(%)+V (E^/?^-))^) + + ^(0,) )+0(h 3 ) 


(49) 

where C is the generator (25) of Equation Q. We thus see that the SGLD method is a first 
order weak method and, dropping the indexing by j for notational convenience from now on, 

Ai( tt)^ = J (e r (/ 2 (0))0" + f(e)cj>W + V 4 ) 


The asymptotic bias in Equation (|41|) has an expansion based on the differential operator, 

1 




<P_ 
dd 2 


= + + (50) 

We thus see that in the case of SGLD the leading order error term contains an extra factor of 
— ^Var (/(0))Vg when compared to the Euler method (n = N), in which case Var (f(0)) = 0. 
This can be understood as the penalty associated with not using all the available points for 
calculating the likelihood at every time step. It results in an extra term in the corresponding 
error expressions given in Theorems |3.2| and |3.3| when compared with the Euler method. More 
precisely, for n <C N the term — 2Var(/(0)) is of size 0{N 2 ) thus making the leading order 
error term 0(hN 2 ) in Equation (37). 

Example 4.1. We illustrate the above findings on the toy model discussed in Section [S| In 
particular, using the expression for u(0,t) from Section \3H \ (replacing pL and cr by p p and a p 
respectively), and that Var(/($)) = Var(L?) for this simple model, the extra term in Equation 
when compared with the Euler method is now given by 

— ^ Var (B)dgu(9,t) = — Var(L>)e~ f /°p. 


A simple integration of this term according to the formula (41) gives that the overall contri¬ 
bution of the extra term, which is, 


a p Var (B) 


Var (B) 
2 A 


and thus agreeing with Equation 
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4.2 Modified SGLD 


As we have seen in the previous section, the SGLD method introduces an extra term — ^Var(/(#)) Vg 
in the leading order error term related to the weak error (Theorem 3.2) and to the ergodic 
averages (Theorem 3.3). When n -C N, this term is of order 0{hN 2 ). In this section we 
will explore a modification of SGLD (mSGLD) for which this term is removed, so that the 
leading order term is exactly the same as for the Euler-Maruyama scheme. Specifically, the 
mSGLD updates are, 


0j+1 = Oj + hf (Oj) + Vh(l- ^Var 


(51) 


We can again derive the weak order expansion as in the previous subsection. Our first step 
is to expand (f){9j + i ) in powers of h and then take expectations with respect to the random 
variable £j. In particular, we obtain 

%(M+r)) =M) + h (fjiPM'Wj) + \fWi) 


+ 


h 2 


f-(0j) ~ Var f(0j)\ </>"(%) + /,(%)0 (3) (^) + 1^(0,) ) + 0(h 3 


Taking expectations with respect to the random sampling and using Equation (48), we obtain 

(52) 


E (<K0 j+1 )) =</»(%) + h/Wdj) 
h' 2 


1 


E T AfJm - Var f(0j) </>"&) + + 0(h 3 ), 


+ y 

where C is the generator of Equation We thus see that the mSGLD is a first order weak 
method and 


1 


1 


Ai{n)ct> = - E r (f 2 (d)) - Var/>) 0" + f(0)</>& + 


Using the expression for C 2 as in the case of SGLD, we have that, 

~ A 1 = \ (/W/'W + % + \ (/'W + f 2 (0) - Mf 2 m + Var m) §2 




(53) 


We see that the leading order term in the weak error and the error for the ergodic averages 
is the same as for the Euler method, which uses all data at every step. In higher dimensions, 
a similar calculation gives the mSGLD updates, 


0j+i = 0, + WM) + Vh(j- ^CovM-)) 0 


( 54 ) 


where 


Covf(0) = E 


and is a d- dimensional standard normal random variable. 
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Remark 4.2. Except for special cases, Var f(Of) does not have a closed form. The simplest 
possible way to proceed without it is to replace it by an unbiased estimator, for example in 
case of sampling without replacement, 


Var fj{9) := 


N(N - n ) 
n(n — 1) z 


£ Vlog7r(a Tji | 9 ) - 


i =1 


m 

N 


This replacement does not change Equation (53) because the smallest order contribution to 
Equation (49) is of the form 


-h 2 K 


Var f{6j)$ = —h 2 Var f{9f). 


However, estimating the variance of the stochastic gradient will affect higher order terms in 
h. For fixed h these terms may have larger contribution to the overall error depending on the 
choice of n and N. In fact, this is true even if we use the exact variance for the toy model in 
Section 2.1 . More precisely, we compare the bias of the mSGLD and the SGLD in Equation 
(70) notice that h 2 term might be larger depending on the choice of n and N. 


5 Finite Time Sample Averages 


Having focused on the SGLD in the asymptotic regime, we will now provide non-asymptotic 
analysis of the mean squared error (MSE) of the finite time sample averages of the SGLD. 
In particular, we will decompose the MSE into bias and variance. The main result of this 
section will be of the form 

K—l 


Bias: 


MSE: E — 
\ K 


E-L ^2 - f <j)(x)ir(x)dx 

i =0 J 

1 K ~ 1 r V 

- ^2 - / 4>(x)n( x )dx J = O 

t=0 ' J 


h+ — 
Kh 


(55) 


= Q h z + — 


1 

Kh 


Remark 5.1. In \Teh et al., 2 01$, a central limit theorem was provided for the decreasing 
step size SGLD which shows a convergence rate of 0(K~ s). At first sight, the bound in 
Equation { 55) seems better because of the -A term in the upper bound. However, due to the 


bias, an additional term of order 0(h 2 ) appears. In order to compare (55) with the previous 


result of [Teh et al., 201$, we optimise the sum of both terms over the step size h. This results 

2 

in a bound on the MSE of the SGLD of order 0{K~i) and agrees with the rate achieved by 
the decreasing step size SGLD. This agreement between decreasing step size discretisation 
and fixed step size discretisation is, to our knowledge, not a widely-known observation in the 
literature. In contrast, for standard MCMC algorithms the MSE is bounded by order 0(iL _1 ) 
due to the Metropolis-Hastings correction that removes the bias. Nevertheless, experimental 
results in the literature demonstrate that the SGLD might be advantageous in the initial 
transient phase of learning, see e.g. \Patterson and Teh, 2013[ Chen et al., 201 $ 


In Section 5.2 we will focus on establishing the bound in Equation (55) which is an 


extension of the work by |Mattingly et al., 20101. The authors obtained similar results for 
finite time sample averages of discretisations of diffusions of the form 


d0 t = f(e t ) + g(6 t )dW t 
on the torus which we review subsequently in Section [5. 1| 


(56) 
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5.1 Preliminaries on the Poisson Equation and Time Averages 

In the following a connection between time averages of the diffusion and the corresponding 
Poisson equation will be presented. For a more elaborate description of this technique we 
point the reader to Section 4.2 of Mattingly et ah, 2010 and references therein. 

The Poisson equation is an elliptic PDE on the basis of the generator associated with 
Equation (56). The generator of Equation (56) is 

£0 = ■ V/ + ^g(9) T V 2 ipg(6), 


while the Poisson equation is given by 

Ci/i = (/> — (/) onl d 


(57) 


where 0 is a test function and (j) := f (p/x/n/dx) with n being the invariant distribution of 
(56). For applications in Bayesian statistics n represents the posterior and the quantity <j> 
the posterior expectation of interest. The posterior expectation cj) is estimated by the time 
average ^ cj)(9(s))ds of the Langevin dynamics. The difference between the two can be 

expressed explicitly by using Ito’s formula on the solution ip of the Poisson equation 


ip(6(t))-i/}(6(0)) = / cj)(e(s))-(j)ds+ S/t/} (0(s)) ■ g(9(s))dW S: 


\\ ms))ds-4> = o)))-^ 


(9(s)) ■ g(9(s))dW s . 


If the first term and the variance of the second term (the martingale term) on the right hand 
side can be bounded, an error bound for the time average is obtained. 

In this article, we are interested in the time average of the Euler discretisation and the 
SGLD. We can build on the ideas of Section 5 in [Mattingly et ah, 2010 which considers time 
discretisations of Equation (56) of the following form 

Ok+I = 0k + hf{9 k , h) + Vhg(9 k , h)rj k , g k ~ M ( 0 , 1). 

In |Mattingly_ et al., 20101 a Taylor expansion is used to express 

Ai/}(9 k+ i) := ij}(9 k+ 1 ) - ij}(6 k ) = h {A 0 i/}) (6 k ) + R k 


where R k is the remainder term. The term To was introduced in Equation (33) in Section 

IQ 

Using that Ci/} = (/> — (/), summing over k and dividing by hK yields 


1 K ~ 1 
k =0 


K-l K-l 

= ^ + ~Kh ^ ^ K ) “ ^ ( 0 °)) “ hK S h ( A ° ~ 0 k ) - Rk ' 

k =0 1 k=0 


Controlling Aq—C and the remainder gives rise to Theorem 5.1 and 5.2 in |Mattingly et al., 2010 
stating that 


E(/>k - (/> 


<C[h + 


1 


h ■ K 


E 


K 


< C [h 1 + 


1 


h ■ K 


(58) 
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In particular, these results were derived for discretisations of SDEs on the torus. This 
simplifies the presentation because the derivates of if are bounded on a compact set. However, 
the same arguments hold if the following assumption is imposed instead 


supE 

k 


^ (o k ) 


< oo for i = 1,..., 4. 


(59) 


verifying this condition will allow us to work on M d . 


5.2 The Bias and the MSE of Finite Time SGLD Averages 

We consider the SDE 

dO t = f(0 t )dt + g(0 t )dW t . (60) 

with g = I being the identity matrix but we keep g in order to make the presentation clearer. 
Based on this setup the recursion of the corresponding SGLD reads as follows 

Afc+i i 0 k /fch T h.2 

where fk is an unbiased estimate of /. The focus of this section is to establish results similar 


to Equation (58) for the SGLD. They will be formulated in Theorem 5.2 


For the readability of the subsequent calculation we use the following notations 

Afc+i = 0 k+ 1 - 9k, 4>k = 4> (Ok ), 

fk = f(0k, T k , h) for the estimate of the drift, g k = g(0 k , h) = I,ip k = if (9 k), V k = V (0 k ) and 


in Section 


3.1 


satisfies 


D k if k = D k if(9 k ). The term To, as introduced in in Equation 
Aq = C but we keep To for clarity. Thus, we have 

Aoipk = • E T f(0k, t, h) + ^S (•, h) : V 2 if(9 k ) 

where S(x, h) = g(x, h)g(x , h) T = I. 

We use the following third order Taylor expansion on 4’(0k+ 1 ) ~ if(0k) in order to obtain 
a bound on fraclK (cj) k — f>) 

1 1 

i>k+1 = V’fc + VV’fc • Afc + i + -A^ +1 V 2 V , fcAfc + i + -ip k+ i^ (Afc + i, Afc + i, Afc + i) + R k+ i 

Rk +1 =1 [ s 3 ^ (4) (s0 k + (1 - s)0 k+ 1 ) (A fc+ i, A fc+ i, A fc+ i, A fc+ i) ds. 

J o 

Here and are the third and fourth order derivative in the form of a trilinear and a 
quadrilinear form, respectively. In this setting, a third order expansion is required in order 
to obtain the h 2 term in the h 2 + bound in the MSE (see Equation (58) or Theorem 5.2). 
More precisely, the remainder of this expansion is forth order which together with the term 
\/Tif rn in Equation ([5]) contributes to the h 2 error term. In order to make the connection to 
the Poisson equation, we write the expansion above in terms of Tq. This yields 


/ 


V’fc +1 = ifk + hA 0 tp k + h* Vi> k ■ (g k £k+ i) + hVV’fc 


A - E t /(0*, t, h) + h~2 (gkHk+if V 2 Ykfk 


\ 


H h 


+ h 2 ^/ 2 ifkfk + gVA 3 ) (Afe+i, Afc + i, Afc_(_i) + r k+ 1 + Rk+i 
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where r k +1 = h\ [{gk^k+if V 2 ^fe (fffc6+i) - S(x,h)^j. 

Notice that M E^o' hA o^k = k J2k=o {^k ~ 4>) is the error of interest. In order to 
control this error, we sum the expression for ifk+i f° r k = 0,..., K — 1 and divide by T = hK. 
Grouping the terms for subsequent inspection gives 


tpK - y>o 

Kh 


K -1 


K -1 


K 


22 {4>k - 4>) + 22 (C - A)) k 


k =0 


k =0 


K -1 


1 K ~ l 1 i K ~ l i 3 - ^ 

- ^ r k+ 1 +- /G \7ip k (gktk+i) +7f, hi 22 fk^ 2 ^k (gkCk+i) 


k =0 


k =o 


k =0 


1 

+- 


M\,k 

K—l 


S 'S-. 




m. 


M 3i k 


K—l 


h 22 Wk ■ (A - Er/(0 fc , r, h)) + ~ ]T h2 fkV 2 Mk 


k =0 


fc =0 


a:-1 


K -1 


Sl.K 


+ ^ X] ^+i+^ g ^ fc(3) (^fc+i, A fc+ i, A fe+ i) 


fc =0 


(61) 




fc =0 


S 3 ,K 


S 3 ,K 


where the Mj k indicate the martingale terms and the other remainder terms. We split 

53,K = Mq,k + Mo,A' + So,K + ‘S'o,^ 

in terms of 

1 3 K ~ l 

m 0 ,k = ^hi 22 (W 3) ((9kVk+i), (gkVk+i), (gkVk+i)j) 

fc =0 
1 K ~ 1 

Mo,k = - 22 ^A (3) (fk,fk,gkVk+i) 

* k =0 

1 K_1 

So, a: = g ^ ^ 23 4 (3) (fti+idii+i./t) 

° fc =0 

a:- 1 

5*0,A' = - 22 ^ (fkifk,fk) ■ 

° fc =0 


Rearranging Equation (61) for ^(Lq 1 (A — 4>) and controlling the resulting right hand 
side of Equation (61) gives rise to the following theorem. 


Theorem 5.2. Suppose that there exists a function V such that the following three assump¬ 
tions hold: 


1. There are 1 ,.. -P^a £ (0,oo) such that the derivatives of the solution if to the Poisson 

equation satisfy the following bound 


ifW 


<V p ^ k , fork = 0,...,4. 


(62) 
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( 63 ) 


2. The drift f and the error from the estimate H := f(9,r) — f(9) satisfy 

E T H{9,r) 2p <V{9f Vp<p * 

"2 < y 


for p * = max {2 p^ t2 + 2, 2 p^^ + 4, 2 p^^ + 1, 2jy , ; 3 + 3}. Moreover, we suppose that the 
IE V p {9 k ) is bounded from above and that this bound is independent ofk, that is 

(64) 


supEIL p (9k) < 00 , Vp<p*. 
k 


3. V satisfies 


sup V (s9i + (1 - s)9 2 y < V[6{f + V(0 2 y, for all 9 1 ,9 2 ,p < p*. 

S 

Under these assumptions there exists ho > 0 and constant C such that for all h < ho 

E(j) K ~ 4> 


Bias 1 cP k 

E[6 k - 


iC[h + M> 


2 £ cfh2+ A 


(65) 

( 66 ) 

(67) 


where 


4>k = "p ^( 0fc ) and ^ 


K-l 


I< 


= E„ 


k =0 


Proof. For each term we bound the term inside the sum by a power of V p and then obtain 
an overall bound using sup, : EV) P < 00 . For example, ^ES\ t K can be bounded as follows 

K-l 


^ Si,K < 


2-trP*l>,2 


k IE Tfc 


k =0 
K-l 


h 


< \Y h2 supE^’ 2+1 < i h 2 K < h. 

k =0 * 


The details of this computation are contained in Appendix [9j 


□ 


Theorem 5.2 and the results for the decreasing step size SGLD |Teh et al., 20141 hold 
under assumptions formulated in terms of the solution if of the Poisson equation. More 
precisely, the crucial step is to establish a bound of the form 


supE 

k 


ipk+ i W (Ok) < 00 for k = 1 ,..., 4. 


This bound is established using Equations (|62|) and 

(9 k 


supE 

k 


< supE ||G|| P ^ < 00 for i = 1, 
k 


Thus, we are left with finding an appropriate Lyapunov function V such that Equations (62) 


and (64) hold. In Appendix 9.1, we formulate strong sufficient conditions on ir that ensure 


that these assumptions are satisfied and that Theorem 5.2 is applicable. 
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6 An Analytic Investigation of the Toy Model 


We now extend our analysis of the one-dimensional Gaussian toy model introduced in Section 
[2] beyond the general results of the previous two sections. More precisely, in Section 6.1 


we 


compare the Euler method, the SGLD and the rnSGLD by comparing the computational cost 
for fixed level accuracy specified in terms of the mean square error in estimating the second 
moment (MSE 2 ), optimising over the step size h, the subsample size n and the number of 
steps M. A numerical solution to the resulting optimisation problem demonstrates that the 
SGLD is advantageous in the lower accuracy regime while it degenerates to n = N in the high 
accuracy regime. On the other hand the rnSGLD does not degenerate and seems to maintain 
a constant speed up compared to the Euler method. In Section |6.2| we then consider the 
MSE 2 and use an analytic expression to study the behaviour of these algorithms for growing 
N. This allows us to extend the analysis of Sections [ 2 ] and [4] (in which we only consider the 
case limit h —> 0) and study the asymptotic bias of the SGLD and the rnSGLD by scaling 
both n and h in N. 


In Section 6.3 we finally adopt a different viewpoint by considering a fixed value of our 
parameter 9, denoted by while we take expectations with respect to the realisation of 
the data {W}. This enables us to study how Ey(MSE 2 ) behaves in the limit of N —> 00 . 
In particular, we find that for the case of the SGLD, the computational cost in order for 
Ex(MSE 2 ) —> 0 is reduced by a factor of N when compared to the Euler method. A similar 
analysis for the expected relative error in estimating the posterior variance (ERE) 


ERE = E 


J_ y^A — 1 n2 _[ J_ y'A- 
K Z- n=0 u i \ K Xi= 


RE' 


^K-l 
=0 


- 1. 


< 7 .„ 


( 68 ) 


reveals that under the constraint Ey(ERE) —> 0 the Euler method and the SGLD have the 
same computational cost on the algebraic scale in N. 


6.1 Minimising Computational Effort for Constrained MSE 2 

In Section [272] we compared the Euler method, the SGLD and the rnSGLD for the same choice 
of r = y. In the following we numerically minimise the computational effort with respect to 
the condition MSE 2 < e 2 . We assume that the computational cost is proportional to M ■ n 
which leads to the problem of solving 

min M ■ n (69) 

F 

subject to MSE 2 (r, M, n)<e 2 
w.r.t. r < 1, M, n. 

Even though we have analytic expressions for the MSE 2 , the solution to the optimisation 
problem does not have a closed form. To conclude our analysis, we illustrate the numerical 
solution to this problem for N = 1000 for the Euler method, the SGLD and the rnSGLD. 
The results, depicted in Figure [3j can be summarised as follows: 

1. as e becomes smaller, the gain of the SGLD over the Euler method in terms of compu¬ 
tational effort decreases (due to the fact that n increases); 

2. as e becomes smaller, the rnSGLD gains efficiency over the SGLD (the reason being 
that n seems to asymptote as e decreases). 
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(a) Minimised computational cost for the Euler, (b) Subset size n for minimised computational 
the SGLD and the mSGLD algorithm cost 



5.X10"' 5.x 10 5. *10 ■’ 5.* 10 

SGLD — mSGLD Euler 

(c) Step size r = tj- for minimised computa¬ 


tional cost 

Figure 3: Minimisition of computational cost oc M ■ n subject to MSE < e 2 
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M 



Figure 4: Scaling of r(e) and M(e) for minimal computational cost subject to the 

MSE 2 (r(e),M(e)) < e 2 


The upper bound obtained in Equation (67) suggests a scaling of M 


~ e 


-3 


and r ~ e to 


obtain an MSE of order e 2 with minimal computational effort. The numerical minimisation 
of M with respect to r and M subject to the condition MSE(r, M ) < e 2 confirms this scaling 
empirically, see Figure [4j 


6.2 The MSE 2 for fixed and increasing N 

We now consider the behaviour for growing data size N where a new data set X is generated 
in each instance. Figure [ 5 ] depicts the MSE 2 for N = 10* with i = 1,... ,4 for the subset 
choices n = IV 0 ' 1 , iV 0 " 5 and A r °' 9 ; each compared to the Euler method corresponding to n = N. 
In this plot we notice that the SGLD outperforms the mSGLD for n = A^ 01 and n = N 0 - 5 . 
The behaviour in Figure [5] suggests that the mSGLD has a larger bias than the SGLD which 
seems to contradict the findings of Section [2] and |4j Previously, we have just considered the 
asymptotic of h — > 0. In contrast, we scale both h = and n = N p in terms of N 

in Figure [5j In the following we investigate this relationship further by using the explicit 
formula for the asymptotic bias which has been made available in Section [2] . 

For simplicity we consider sampling without replacement, using the expression in Equation 
(16). Similar conclusions hold for sampling with replacement. First we consider the mSGLD. 
From Equation (21) and using the parameterisation h = r/A of the step size, the excess 


asymptotic bias becomes 


h z 


Var {B) 2 _ (r/T) 2 (^) 2 IV 2 Var(X) s 


4(2 A - A 2 h ) 


4(2 - r)A 


Because A ~ N, we see that the excess bias stays bounded for large N if and only if n > 
N 2 . In contrast, the same consideration for the SGLD shows that the excess bias due to 
subsampling vanishes so long as n —> 00 when N —> 00 . 

We can also identify the regime in which the mSGLD has a smaller asymptotic bias than 
the SGLD. From Equations (18) and © we see that this is the case when 


h 2 Var 2 (L?) hXar(B) 


4(2 A-A 2 h) ~ 2 A- A 2 h' 


(70) 
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SGLD 


mSGLD 


SGLD 



KT 2 0.1 1 10 100 


Iterations through the data set 

— n=N — n=N A 0.1 




0.01 0.10 1 

Iterations through the date 

— n=N — n=N A 0.! 


(a) The SGLD for n = N^o 


mSGLD 



(b) The mSGLD for n = N^o 


SGLD 



(c) The SGLD for n = 


(d) The mSGLD for n = N? 


(e) The SGLD for n = N A 


mSGLD 



(f) The mSGLD for n = N& 

- N=100 - N=1000 

- N=10000 . N=100000 


Figure 5: MSE 2 of the time average for the SGLD and the mSGLD for the second moment of 
the posterior as N —> 00 . Notice that Figures (c) and (d) and (e) and (f) have the same scaling 
respectively. Moreover, figures (a) and (b) have separate scaling because of the instability of 
the mSGLD. 
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Using Equation (16), the above can be rearranged to 


i > 


4 16 cr? 


N 


Let c = 


Equation (15), we get 


N 


be the relative size of the subsampling. Using h 


c > 


2r IVVar X 


16cj2 ( \ + A ) + 2rA r Var A' 
\ a e J 

which in the limit of large data sets N —>• oo yields 


r/A where A is given in 


^ 2rVarA' ^ n 
° ~ 16 + 2rVar X > ' 

In conclusion, for a fixed step size given by r/A , the mSGLD has a smaller bias than the 
SGLD if the above holds. In other words, the subsampling size has to be linear in the size of 
the data set for a fixed choice of r. 


6.3 Limit of the MSE and ERE for well-specified Data as N —* oo 

In order to investigate the limit of N —> oo, we need to specify the behaviour of the data 
as well. We study this in the well-specified case, in other words, we assume that the data 
is generated by the model for 0^ = 1. Previously, we have obtained an analytic expression 
for the expectation of the MSE 2 with respect to the realisation of the noise driving the 
algorithm. In contrast, we take expectations with respect to the realisation of the data X 
and the noise driving the algorithm in the following results. These results are formulated in 
analytic expression^ for the MSE 2 and the ERE depending only on M, n, r and N. We then 
choose M, n and r as functions of N and study the limit N —> 00 and how this affects the 
computational cost and the behaviour of the ERE and the MSE 2 as N —> 00 for the different 
algorithms. 

For the Euler method (n = N) we need to take h < ^ x in order to make Equation 
© stable. Moreover, we need the number of steps M to be of order N to approximate the 
diffusion to a time of order 0(1). Because we evaluate N data points per step, this heuristic 
argument suggests that the complexity is of order O (IV 2 ). Furthermore, we verify (using 
Mathematica®), that for the Euler method (n = N) 

lim ExMSE = 0, lim Ex ERE = 0 (71) 

JV—>-oo AT—>-oo 

for the choices M = N 1+2e and r = N~ € for any e > 0. The computational cost for fixed N 
is M ■ n = N 2+2e . Thus, this confirms the heuristics we used for the Euler method in Section 


A natural next question to ask in terms of the SGLD is if one can have Equation © to 
hold but for smaller computational complexity than the Euler method. Using Mathematica®, 
we obtain the following theorem for the MSE 


Theorem 6.1. For any e > 0 and the choices h = N 1 e , M = N 1+2e and n 
satisfies 


lim Efl.. x 
JV-s-oo 


1 

M 


M -1 


—(pi+ a i)) — ° 


k =0 


1, the SGLD 


see Appendix 10.2 for a sketch of the derivation for the MSE 2 (the derivation for ERE is similar) 
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SGLD 



N=100- N=1000 

N=10000 . N=100000 


Figure 6 : Expected relative error of the estimate of the variance of the posterior based on 
the SGLD. 

This constitutes a substantial gain compared to the Euler method because it reduces the 
computational complexity in the data size N from being almost quadratic to almost linear. 

We now draw our attention to the expected relative error in estimating the posterior 
variance, abbreviated by 



Because the posterior variance goes to zero as N =>- oo, it is conceivable that it requires 
more computational effort to ensure that limjv^-oo Ex ERE = 0. In order to illustrate the 
behaviour of the ERE, we first consider the behaviour for a fixed data set and repeat the 
experiment of Figure [5] in Figure [ 6 j The latter demonstrates that the asymptotic bias for the 
choice n = Nz (the asymptotes for the grey lines) have an increasing value in N. We used 
h = 2(^4 ~ x which decreases with N. However, we show below that is requirement cancels 
exactly the gain from n -C N at least on the algebraic scale in N. 

In particular, we now choose r = N~ a , n = and M = iV 7 . Hence the computational 
cost is . The step size h = satisfies h ~ N~ l ~ a because A ~ N . Since the algorithm 
performs M = N' 1 steps, we expect it to approximate the diffusion on the time interval 
h-M = ]V -1 ~ Q+7 . Therefore it is reasonable to require that 7 > 1 + a. Under this assumption 
and with the help of Mathematica®, we reduced the limit above to 


lim Ex ERE 

AT—>-oo 


lim 

TV—>-oo 

( 2 • N~ a -P +3 

]y-2a-P+3 

\(N + l) 2 (N~ a - 2) 2 

(N + l) 2 (N~ a ■ 

1° 

if a + /3 > 1 


1 00 

if a + /3 < 1. 



Thus for limx->.oo ExERE = 0 it is necessary that a + /3 > 1. This condition in turn implies 
that the computational complexity satisfies 


N^x 

steps 



cost per step 


j\ri+« W 3 


N l+a+P > 
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Thus, there is no computational gain for the ERE in the limit N —> oo on the algebraic scale 
in N. We note that picking 0* = ^ instead of 6* = 1 does not change the results. Thus, a 
closer initialisation does not change the result for the ERE. 

7 Logistic Regression 

In the following we present numerical simulations for a Bayesian logistic regression model. 
The data items are given by covariates Xi £ that are labeled by y* £ {—1, l}.We assume 
the data y* £ {—1,1} is modelled by 

p(Vi\xi,P) = viViftxi) (73) 

where cr(z) = 1 +CX p(_-j £ [0,1]. The model posses the assumption that y* depends on Xi 
through the linear relationship [3 t Xi. Nevertheless, logistic regression is commonly used after 
a preprocessing has taken place and is therefore used here for numerical illustration. 

We put a Gaussian prior A7(0, Co) on /3, for simplicity we use Co = I subsequently. By 
Bayes’ rule the posterior ir satisfies 


tt(/3) oc exp ||/3||jO n^*i). 

' ' i= 1 

We consider d = 3 and N = 1000 data points and choose the covariate to be 


Xl,l 

Xl,2 

1 \ 

X2,l 

X2,2 

1 

2h 000,1 

^1000,2 

1 J 


for a fixed sample of Xij ~' A f (0,1) for i = 1,... 1000 and j = 1, 2. We use a long run 
of the Random-Walk-Metropolis algorithm to estimate the posterior mean. 

On that basis we estimate the MSE of the SGLD based mean estimate using 100 runs of 
the algorithm with step size h = 0.002 for various subset sizes. Figure [7] depicts the MSE 
as function of the iterations and effective iterations through the data set. Notice that for 
this example the variance of the stochastic gradient f{6) depends on both 6 and all the data 
items. For this reason we replace Var f(6) by an estimate Var f(9), see also Remark 


4.2 


We 

note that the mSGLD is superior for n = 150, inferior for n = 50 and for n = 10 the MSE of 
the mSGLD does not drop below 1. 


8 Conclusion 

This article presents the mathematical foundations that are necessary for posterior sampling 
for stochastic gradient methods with fixed step sizes. We derived an error expansion of 
the asymptotic bias in terms of powers of the step size and identified how the constant in 
the leading order term depends on the unbiased estimator of the gradient. We construct a 
modified SGLD to match the Euler method in this asymptotic expansion. These asymptotic 
results are complemented by upper bounds on the bias and the MSE over a finite time 
horizon. Minimising the MSE with respect to the step size yields a decay of the error at the 
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- SGLD - mSGLD 


- SGLD - mSGLD 



Figure 7: Expected MSE of time average for the SGLD and the mSGLD for the mean of the 
posterior 


same rate as the decreasing step size SGLD, see Remark 5.1 These theoretical findings are 


completed with extensive analytic investigations of a one dimensional toy model that allows 
the derivation of analytic expressions for the sample average and its moments. Finally, this 
yields an exact quantification of expected errors. The results of this investigation can be 
summarised as follows: 


• In the high accuracy regime the SGLD deteriorates to the Euler method while the 
mSGLD prevails. 

• For small data batches the bias of the mSGLD is larger than for the SGLD. 

• In the limit as the number of data items goes to infinity the SGLD reduces computa¬ 
tional complexity of estimating the second moment with vanishing MSE by one power 
of the number of data items. 


This recommends the construction of new and a study of existing modifications of the SGLD 
such as the Stochastic Gradient Hamiltonian Monte Carlo |Chen et al., 2014 and the Stochas¬ 
tic Gradient Thermostat Monte Carlo [Ding et ah, 2014] algorithms. 
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9 Appendix A: Finite time Ergodic Average based Poisson 
Equation 


Proof of Theorem 5.2. Rearranging Equation (61) for j^J2k=o(0k ~ 0)-, the bias and the 
MSE can be controlled as follows: 


E Yh SUpE ^’° ~ f' 

Because Aq = C for the SGLD it is left to bound to bound E^S) K for i = 0... 4. First we 


consider i = 1 and use Equation (A. 6 ) 


K -1 


^ Si, K < 


'Th 


k =0 
K -1 


fk 


^£> 2 supEV** 


< 

~ T — 

k=0 


2+1 < - h 2 K < h. 

rji 


This procedure will be used over and over again. It can be summarised as follows: 
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1. bounding the terms in the sum by a power of V p , using Equation (A.6) and the as¬ 
sumption on derivates of ip', 

2. then derive the bound using supjEV^ < oo. 


For i = 2 we additionally use that Equation (65) implies 
r i 


'0 


sV 4) (■sOk + (1 - s)0 k+ 1 ) (A k+1 ,A k+1 ,A k+1 ,A k+1 ) ds < (V^ A + V^f) ||A fc | 


which allows us to follow the general procedure 

K -1 

be-i-i 


1 


— < — 


< _ 

rv rj ~i 


< _ 

rv rj~\ 


-E ^ A* 

fc =0 

^ , £ 1 k 2 (r «- 4 + F« i 4 )E T ||A i+1 || 4 

fc =0 

1 1 
-E £ / l 2 su P EE^ +2 (0 l ) < -h 2 < h. 


k=0 


We apply the general procedure to ESq k 


^S 0 ,K 


K -1 


< 


< 

r^i 


-E^h 2 E^' 3 E r ||^ + i| 

fc=o 
if—l 

E ^ /r 2 supEE^’ 3+ 5(0i 


1 

T 


fk 


k=0 


< -h 2 < h. 

rj~) r^j 

, if-i 


E5 0l * < -E^/r 3 E^’ 3 E T 


< 

r^i 


T 

1 


k=0 

K-l 


fk 


Y h 3 supEE^- 3 +2(^) < -AA 3 < h 2 . 


k=0 


Thus, we have established the bound on the bias given by Equation (A.3). 


In order to establish the bound one the MSE in Equation (67) we note that Equation 


(61) yields 


1 


if-i 


El 

, k=0 / 


< E 


{'ipK - ipy 

rp2 
2 


+^E ES h' + ^E EM i) 


i =0 


rp2 / ; 

4 = 0 


First we note that 


E 0/hf—< J_ EE 2 ^’ 0 < — 

j^2 ~ J^2 J 12 • 
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The Sf K terms can be bound in similar way as above with the additional use of Cauchy- 
Schwartz inequality to break the correlation between Vf and Vj. 
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Similarly, we bound 
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lar bounds can also be obtained for 
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For Martingale terms the cross terms vanish which allows us to obtain the following bounds 

1 1 1 2 

TTTi 7i t2 \ A / nr'/ f La a m.b b k.l k,m\\ 

EM 1J< = { ED lm^‘ {.9i Vi+i9i Vi +1 - 9i 9i )) 
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j~i2 


1 

d 2 5^ supEvf 


The following term is the crucial Martingale term as it yields the O (y) contribution 


f2 EM 2J< ~ ^2 


K -1 
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< - 4 /i y ef 2p,m < - 

~ 7^2 / V k rp 


Similarly, we estimate 
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The terms ^EA/q K and ^EAfg K can be bounded in the same way 


i ro Tr 2 


1 EAf 2 < h3 V EF 2 ^’ 3 < h ' iK < h2 


1 h/ 

ri^.K S ^E EV « 


^ VmffuW <r AA , '' 


^2 

fc =0 


< - < — 

rsj rp2 — jp 


The additional part for the SGLD is the term corresponding to the Martingale 


1 1 A ' _1 
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k =0 

For all these calculations to go through need sup^ EVf to be bounded. Collecting the orders 
present, we see that 


p* = max { 2^2 + 2, 2 p^ A + 4, 2 p^ A + 1 , 2 p ^, i3 + 3} 


is sufficient. 
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9.1 Sufficient Conditions on 7r Ensuring Finite Time Bounds on Bias and 
MSE 


We formulate a sufficient condition on 7 r that ensure that Theorem 5.2 is applicable. This 
hinges on deriving a sufficient condition for Equations (62), (64) and (65). The aim of 


this section is to establish and motivate the sufficient condition formulated in the following 
theorem. 


Theorem 9.1. Suppose the following condition holds 


{9, Vlog 7 r o ( 0 )) < -a 
(9, Vlog 7 r(Xj | 9)) < -a 


+ P 

+ P for i = 1 , 


,N. 


(A.l) 

(A.2) 


then Theorem 5.2 is applicable for polynomially bounded and continuous <f, that is 


Bias 


E 


K, = 
2 


E(j) K ~ & 




K 


£ c{h ‘ + ~k 


First we appeal to a sufficient condition for Equation(|64[) before summarising a regularity 
results of |Pardoux and Veretennikov, 2001 which allows us to establish Equation ( [62]) . We 
believe that the sufficient conditions above can be weakened, but this requires improving the 
results of |Pardoux and Veretennikov, 2001 which is out of the scope of this article. 

The condition sup fc EV p (9k) < oo (that is Equation (64)) is established for all p < p* by 
Lemma 5 in |Teh et al., 2014| if p* satisfies the following assumption. 

Assumption 9.1. The drift term 9 H > ^ Vlog 7 r($) is continuous. The function V : —> 
[l,oo) tends to infinity as || 0 || oo, is twice differentiable with bounded second derivatives 
and satisfies the following conditions: 

1. V is a Lyapunov function for the Langevin dynamics, i.e. there are constants a,P > 0 
such that for every 9 G we have 


(vV(9),^Vlogir(9)^ <-aV(9) + p. 


(A.3) 


2. The following bounds hold 

• There exists an exponent pu > 2 such that 

E(\\H(9,U)\\ 2ph ]<V Ph (9). (A.4) 

Moreover, this implies that E[ \\H(9,U)\\ 2p ] < V p (9) for any exponent 0 < p < pu- 

• For every 9 G we have 

l|VE(0 )|| 2 + ||Vlogvr( 6 >)|| 2 < V(9). (A.5) 
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Notice that for / based on subsampling we obtain that pu = oo if 


|Vlogp(y \QW<C{y)V. 


Notice that Assumption 9.1 also implies 


E ||4+i ~9 k tf»<V* 


E 


fk 


2 P 


<v p k 


(A.6) 


if for p < ph■ Equation (65) could now simply be formulated as an additional assumption, 
however currently we need even stronger assumptions to verify Equation (62). Subsequently, 
we show how the result s of J Pard oux a nd Verete nnikov, 200T] can be used to establish Equa¬ 
tion (62) if Equations (A.3) and (A.5[) hold for V = ||#|| J + 1. 

Theorem 1 and 2 of Pardoux and Veretennikov, 2001 characterise the smoothness and 


growth of the solution to the Poisson equation associated with Equation (60). This is impor¬ 
tant for our results because the key ingredient for the proof of Theorem 5.2 is 

v> (W) m 


sup E 
k 


< oo for k = 1,..., 4. 


Because we have already established in Section [4] that 
sup,; E V p ( 0i ) < oo it is left to verify 




<V p ^ k , for A; = 0,..., 4 


The assumptions needed to apply the Theorem 5.2 results are 

e 


m, 


■ - r 


|| 0 || > M 0 


0<A_ < (ff(0)5*(fl)p||-,p|i ) < A+ <oo. 


(J62j) 

(A.7) 

(A.8) 


This holds if Assumption 9.1 is satisfied with V (6) = \\6\\ z + 1. 


Theorem 9.2. J Pardoux and Veretennikov, 200 11 Let f = 0 and Equations (A.7) and (A.8) 
are satisfied. Then there exists a solution 6 Wf oc to the Poisson equation 

Ci/j = 4> — 4> 

1. If there is a C such that 

<<7(1 + 

for some ft < 0, then u is bounded. Moreover, 


and 


sup |(0)| < C sup |/| (1 + 
e e 


IIW’II < c 


-p 


here 
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2. if there exists a constant C and some (3 > 0 such that 


\m\<ca + mf 

then there exist a constant such that 

\if(o)\<c'(i + \\e\\f. 

Finally there exists C such that 

IIW-II <<7(1 + 101*). 


Remark 9.3. We believe that assumption Theorem 9.2 can be weakened to be of the form of 
Equation (A.3) but it is out of the scope of this article to explore this direction. 


In order to iterate Theorem 9.2 we note that the derivatives if can be expressed as solution 
to Poisson equations with different RHSs. 


Cif 

= 

Adiif 

= dif>- 

Adi j if 

= dij(f - 

Adijkif 

dj.jkf 


2 V ^ 


(A.9) 

(A.10) 

(A.ll) 


~^d k if ■ d jk f - ^7 if ■ d jk f - ^Vif • d k f 


We will denote by f3^^ numbers that satisfy 


sup | 

\a\=i 


< 

r^j 


(l + \d\^> 


(A.13) 


where we used multi-index notation for derivatives. We use a similar notation for the deriva¬ 
tives of /, that is /3fj and assume that these bounds are a priori given. 

Using Theorem |9.2| we can obtain p to satisfy Equation (62) in terms of the /3’s which 
we formulate as the following lemma. 


Lemma 9.4. Suppose that <f and its derivatives are bounded and Assumption 9.1 and Equa¬ 
tions Equations (A. 1) and (A. 8) hold. Then the choice 


PV>,o — 

pvt = 

Pi’, 2 = 
Pi>, 3 = 
Pip ,4 


0 

0 

ft /,1 

2 

R v ^’ 2 

Df.l V -y 

o V 1 + ^ Df,3*) 


satisfies Equation (62). 
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Proof. Assumption 9.1 yields that /3 /q = 1 is a valid choice. Applying Theorem 9.2 


to 


Equation (A.9) implies that /^o := := 0 satisfies Equation (A.13). Applying Theorem 

A/>,2 <Pf, l- 


9.2 to Equation (A.10) yields that 


Applying Theorem 9.2 to Equation (A.ll) yields that 

Pip,3 < 2 /3/,i V /3/ j2 - 


Applying Theorem 9.2 to Equation (A. 12) yields that 


/5y>,4 < (/5/,i + (2/3/,i V /3/,2-)) v (Pf, l + /?/,2-) V /?/,3 
< 3/3/,i V (/3/p + /3/,2 -) V ^/,3- 


Thus we have established Equation 


□ 


10 Analytic expressions for the Gaussian Toy Model 

We sketch the derivation of the analytic expression of the MSE are used for the plots in 
Section [6l 


10.1 Expected MSE for fixed Data Sets 

We outline how an analytic expression for the MSE of the sample average can be derived. 
The following method generalises to any polynomial test function but we concentrate on the 
sample average for the second moment of the posterior given by S 2 = X+lo 1 @j - Its MSE 
can be expressed using Equation (12) 


MSE = E ( 

\ M 


M -1 \ z 

E + a Pn = E5 2 - 2ES 2 + a 2 ) + {fi 2 p + a 2 ) 2 . (C.l) 

k =0 / 


In order to express in Equation (C.l) we derive the recurrence equations for E 9) for 


i = 1, 


, 4 by taking the expectations of 


9j +1 = (1 — A h)9j + hBj + Vhr)j 

9 2 +l = (1 - A h) 2 9 2 + h 2 B 2 + hr] 2 + 2(1 -A h)9jhBj + 2(1 -A h)9jVhr]j + 2hB j Vh.r ]j 

(C.2) 

9j +1 = ^(1 — A h)9j + hBj + Vhrjj\ 

= (1 - A h) 3 9j +3(1 -A h) 2 9 2 hBj + 3(1 -A h)9jh 2 B 2 + h 3 B 3 
+ 3Vhr] (...) + 3r/ 2 h ((1 — Ah)9j + hBj ) + rfhz 
9j +1 = (1 - A h) 4 9j +4(1 -A h) 3 9 3 hBj + 6(1 -A h) 2 9 2 h 2 B 2 
+ 4(1 -A h)9jh 3 B 3 + h 4 B 4 + 4 7 / (...) + 4? ? 3 (...) 

+ 6 hr] 2 ((1 — Ah) 2 9 2 + 2(1 — Ah)9 j hB j + h 2 B 2 ) + 7 fh 2 . 
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(C.3) 

(C.4) 

(C.5) 














The recurrent equation for Kdj is linear and first order and can therefore be solved explicitly. 
Plugging the result into the equation for K6 2 turns it into a first order linear equation as 
well. Repeating this processes yields explicit expressions for E 0 l - i = 1,... ,4.The sums can 
be carried out explicitly because the terms are of the form of a geometric sum or a geometric 
term with a polynomial factor. This allows us to obtain an analytic expression for ES^ by 
reducing it to E0* as follows 


= m 


The cross terms can be removed using Equation (C.2) so that 


/ M -1 M -1 M—i \ 

E^E"? E «? 

y i=0 i=0 j=i+1 ) 


e 2 = (i - Ah) 2{j - i] e 2 + J2 (i -Ah) 2k 

k=0 

\h 2 B 2 _ l _ k + h^rj 2 _ l _ k + 2 (1 — Ah) Bj_i_ k h9j_i_ k + 
+2 (1 Ah) fjj—\— k 9j—\— k \/~h T 2jh 2 Bj_\_ k Tjj_\_ k 


Plugging this into Equation (C.6) yields 


(C.6) 


(C.7) 


M—l / M—l \ 

E5 2 = 1 + 2 E c 1 - Ah f ^ l) 

i =0 y j=i +1 / 

M—l M—l j-l-i 

+Ejg2 E»? E E 0-Ahf k 

i= 0 .7=1+1 k= 0 

[h 2 B 2 _i_ k + hr) 2 _i_ k + 2(1 — Ah) Bj_i_ k h6j_\_ k 
+2 (1 Ah) f]j_\_ k '\/~h + 2/i 2 Bj_i_ k ijj_i_ k 

Using the recurrence Equation to express 9j_i_ k in terms of (9,; we conclude that ES'! is equal 
to 

M—l / M—l \ 

w Eb* i+2 E d-M+M 

i=0 \ j=i +1 / 

M—l M—l j-l-i 

+ m 2 E E Ed- AKf [hW + h}. 

i= 0 j=i+ 1 fc=0 
M—l M—l j-l-i 

+ e ^E 0 * 2 E E 

i=0 j=i+1 k=0 

(j-l-k)-i-l 

(1 _ Ah )+ £ (1 _ 

1=0 
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Taking the expectations into the sum yields 


E Si = 


M -1 / M—l \ 

pE'diu E 

i=0 \ j=i +1 / 


^ M—l M—l j—1—i 

+M2 E E E (! - Ah ? k N eb2 + ft l ■ 

i= 0 j=i+l fc=0 

M—l M—l j'-l-i 

+ e I7*E E0? E E a - Ahf k 2(1- Ah) Ee/i(l - 


i=0 j=i+l fc=0 
M—l M—l j—1— i 




+ Ef2 E W i EE 1 - 2 E Bh Y. (! - 


i=0 j=i+l k =0 


We have an expressions for E B but in the following we derive the expressions for E B 2 required 
to express E5|. The terms E£> :i and E B 4 are needed for the derivation of E dj. In order to 
derive expressions for E B p , we introduce the power sums 


EE 


and the elementary symmetric polynomials 


e o — 1) e i — 'y ] X tl e 2 — Y2i<i<j<N XiXj„ ..., ei\r — X j. (C.8) 

i =1 z=l 

Computing e* naively has complexity of order O (-/V*) for % <C IV. Using Newton’s identities 

ei = pi, e 2 = ^ (eipi - p 2 ), e 3 = ^ {-eip 2 + e- 2 pi + P3), e 4 = ^ (eip 3 - e 2 p 2 + e 3 pi - p 4 ) 

e* can be expressed in terms of pk, k < i each of which can be computed with complexity of 
order O (N). 

AT V n X 

We consider the term B = — Ti where r,; are sampled with replacement from a fixed 

data set {1,..., IV}. The second moment can be calculated as follows 


E B 2 = 


E E^Xr 


n(n — 1) KX Tl X T2 +nET Tl T r 


Mom 2 1 Mom 2 2 


We use Newton’s identities to express Mom 2j i and Mom 2)2 


M ° m2 ’ 1 = N(N~ 1) M ° m 2,2 = ^- 
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Similarly, we obtain 


E B 6 = 




Tk 


N \ 3 
n2al) 


i,j,k 

( 


Mom3 : i 

Moms^ 


n(n — 1 )(n — 2) EX T1 X T2 X T3 +3n(n — 1) EX T1 X^ 2 +n EX^ 

^ V ^ ^-S/-* / v —^ 

Mom3,i Mom3 t 2 Mom3 ? 3y 

'Yhi^j^i XiXjX t — Q e3 


\ 


N(N — l)(N — 2) N(N — 1)(N — 2) 

P1P2-P3 ,, P3 

N(N - 1) ' M ° m3 - 3 =iV' 


A similar calculation yields a representation of ER 4 in terms of pi,... ,p 4 . 

10.2 Expected MSE for Random Data 

We sketch the derivation of the MSE 

M—l \ 2 


Ee ’ x ( « 0*P + 


o': 


k=0 


where we take expectation with respect to X j A f (6\ (Jx) for i = 1,..., N and the 

randomness in the recursion for 6j . We obtain an analytic expression for the MSE by deriving 
expressions for Ex.o&j- We illustrate the computation for p = 2, noting that we assume 
do = 0 a.s.. We know that 


3 - 1 _ 

8j = (1 — Ah) h, 2 + hrjj_i_k + 2 (1 — Ah) rij—i—kOj—i—kVh 

k =o 


(C.9) 


3 

+2hzBj-i-kVj-i-k + 2(1 Ah) Bj—\— k hdj—\— k 

i-i _ 

= ^ ] (1 — Ah) h 2 Bj_i_ k + 2 (1 — Ah) r]j-i-kOj-i-kVh + 2hz Bj—i-krjj—i—hCAO) 

k =o 

i-i-fc-i 

hVj-i- k + 2 (1 - Ah) B j _i_f t h ^ (1 - Ah) 1 [hBj_ k _ 2 -i + VhVj-k-2-ijCAl) 

1=0 

The expectation Ex,e Oj therefore boils down to calculating ER 2 , ER R 7 and ER where R 
and B' are independent samples of Equation (15). We start by calculating 


ERR' = 


N 1 


n 2 4o-4 

x »=i j=i 


EEe-w-v 




n n / 1 

EE(* 


AT- 1. 


„ . , , . —EX + 

n 2 4a£ N 

x i=i j=i 


-EX A 


- j£4(E“,lE”=+(» t2 + +) + A 1 < ,,2 )' 
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Similarly, we obtain 


EB 2 


N 2 n(6^ 2 + <Tg) + n(n — 1)6^ 2 
n 2 4cr^ 1 


Deriving E Q v - for p = 1 , 2 ,3,4 requires the calculation of EB® 1 B^ 2 B^ 3 B^ 4 where Bi are i.i.d. 
following the distribution of Equation (15) for ctj > 0 and a i — 4- The arguments so far 
allow us to derive T± and X 3 in 


MSE = E(S 2 -2S 2 (p 2 p + a 2 p ) + (p 2 p + a 2 p ) 2 ) 

= MS' 2 —2 ES 2 p^ —2 ES 2 (Jp + E (p p + 2 p^cr 2 + 0 ^) . 
T t 2 t 3 t 4 

Recall that the posterior for this toy model is given by 


Af(p p ,a 2 ) 


£ 2=1 Xi 


A f 


+ N 


1 N 

2 T 9 

^2 


and hence T 4 can be computed explicitly. The summands of T 2 can 

and EBB' p 2 . 
file. 


2,,2 


Equation ( |C.9[ ) in terms of the quantities EBp p , E B z p 


can be obtained from the supplemented Mathematical 


be derived similarly to 
The explicit expression 
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